Final Project

Name: Accidents Severity Analysis and Prediction

Project Type (Analysis) DSC 478 Team Members: Di Han, Wanshu Wang

Data Pre-processing - missing data

The followings are all the columns which have missing data: Number 69.000715 Precipitation(in) 33.675953 Wind_Chill(F) 29.637007 Wind_Speed(mph) 8.499773 Humidity(%) 3.001786 Visibility(mi) 2.916170 Weather_Condition 2.902714 Temperature(F) 2.838469 Wind_Direction 2.760965 Pressure(in) 2.392643 Weather_Timestamp 1.996222 Airport_Code 0.280199 Timezone 0.151841 Zipcode 0.061673 Sunrise_Sunset 0.005475 Civil_Twilight 0.005475 Nautical_Twilight 0.005475 Astronomical_Twilight 0.005475 City 0.005475

We deleted Number, because it has address information, we have many features including address information, like zipcode, street, etc. We will drop more this kind of feature to avoid the redundancy.

For Precipitation, we don't want to simply fill in with mean/median value because it is related to the humidity, visibility and otherweather condition, pressure and many other weather related features. Considering this point, we dropped this feature, because we have many other weather related features which will provide this information.

Similarly, we deleted the Wind_Chill(F).

We fill Zipcode with the value which has same pair of City and State pair.

For these features, Wind_Speed(mph), Humidity(%), Visibility(mi) , we fill them with the mean value of the month in a certain city or state.

We will drop these two columns. Weather_Timestamp 1.996222 Airport_Code 0.280199

we analyze the weather conditions, we can see that there are lots of them, so it's better to reduce the number of unique conditions.

We drop timezone because it address, info location, Also Drop Sunrise_Sunset, Civil_Twilight, Nautical_Twilight, Astronomical_Twilight.

Data Exploratory

California(CA) is the most populated state, followed by Texas(TX) and Florida(FL), they are also in the top 5 of the states with the higher number of accidents. Oregon (OR) is the 3rd state with the most number of accidents and the 27th most populated state in the US.

Interestingly, most accidents happen on the days with a Fairweather, follow by days Mostly Cloudy. Most accidents happen on days with temperatures between 50°F and 75°F (10°C and 23°C)

We can see the top 5 weather conditions are kindly good. None of them are extreme weather condition, like rain, snow or fog

From the matrix we can see that the start and end GPS coordinates of the accidents are highly correlated.

In fact, from the medium distance shown before, the end of the accident is usually close to the start, so we can consider just one of them for the machine learning models.

Moreover, the wind chill (temperature) is directly proportional to the temperature, so we can also drop one of them.

We can also see that the presence of a traffic signal is slightly correlated to the severity of an accident meaning that maybe traffic lights can help the traffic flow when an accident occurs. From the matrix we can also note that we couldn't compute the covariance with Turning_Loop, and that's because it's always False.

Handle unbalanced data

Synthetic Minority Oversampling Technique

Due to the imbalance of this dataset mentioned in section 2.1, we used Synthetic Minority Oversampling Technique to balance the dataset. We used imblearn package and SMOTE method. The following is balanced for the four levels. There are 4849528 observations after SMOTE, compared to 1516064 as before.

Data Preparation

Clustering

Next, Perform Kmeans clustering (for this problem, use the Kmeans implementation in scikit-learn) on the image data (since there are a total 7 pre-assigned image classes, you should use K = 7 in your clustering). Use Euclidean distance as your distance measure for the clustering. Print the cluster centroids (use some formatting so that they are visually understandable). Compare your 7 clusters to the 7 pre-assigned classes by computing the Completeness and Homogeneity values of the generated clusters.

The centroids provide an aggregate representation and a characterization of each cluster.

PCA

Perform PCA on the normalized image data matrix. You may use the linear algebra package in Numpy or the Decomposition module in scikit-learn (the latter is much more efficient). Analyze the principal components to determine the number, r, of PCs needed to capture at least 95% of variance in the data. Then use these r components as features to transform the data into a reduced dimension space.

Perform Kmeans again, but this time on the lower dimensional transformed data. Then, compute the Completeness and Homogeneity values of the new clusters.

Discussion

We used K-Means with four clusters for clustering to group the balanced dataset and compared the groups with the original groups with different severity levels. We used PCA for reducing features. The following are the results for clustering before PCA and after PCA. The completeness score after PCA is 0.014798401142677924, compare to the previous completeness score: 0.01474845550687712, which has improved. The homogeneity score is ​​0.01352069029332516, compare to the previous homogeneity score: 0.013482016769888854, it has improved. The clustering didn't perform very well. It is with bad completeness and homogeneity. After PCA, it had improved very slightly.

​​From our previous study, we know there is an important feature: source, which means who is responsible to record this accident record. The record source means that severity reported by different sources may differ in their underlying impact on traffic. This updated dataset in this project has been edited by removing many features including source because of some requests of the departments. Considering all these above, we may research further for the appropriate type for the accidents.

Acknowledgements

Moosavi, Sobhan, Mohammad Hossein Samavatian, Srinivasan Parthasarathy, and Rajiv Ramnath. “A Countrywide Traffic Accident Dataset.”, 2019.

Moosavi, Sobhan, Mohammad Hossein Samavatian, Srinivasan Parthasarathy, Radu Teodorescu, and Rajiv Ramnath. "Accident Risk Prediction based on Heterogeneous Sparse Data: New Dataset and Insights." In proceedings of the 27th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, ACM, 2019.